Introduction

The goal of this project is to predict customer default in the banking sector by leveraging a dataset containing information on 700 past and 150 prospective bank customers. The dataset encompasses various features such as age, education level, employment duration, address duration, income, debt-related metrics, and the binary default status (Yes or No).

The project aims to explore the relationships between these features and customer defaults, employing both data visualization and statistical modeling techniques. Through the analysis, I intend to provide insights into the factors influencing default rates and develop predictive models to enhance our understanding and forecasting capabilities in the banking context.

Data Preparation

Loading Required Libraries

library(foreign)
library(dplyr)
library(ggplot2)
library(plotly)
library(ggcorrplot)
library(xgboost)
library(margins)

Loading the Data

path <- "D:/Studies/Materials/Second-cycle/I year/I semester/Coding for DS and DM/R/r-project"

data <- read.spss(paste0(path, "/bankloan.sav"), to.data.frame = TRUE) %>%
  select(-matches("preddef"))
head(data)
##   age                           ed employ address income debtinc  creddebt
## 1  41                 Some college     17      12    176     9.3 11.359392
## 2  27 Did not complete high school     10       6     31    17.3  1.362202
## 3  40 Did not complete high school     15      14     55     5.5  0.856075
## 4  41 Did not complete high school     15      14    120     2.9  2.658720
## 5  24           High school degree      2       0     28    17.3  1.787436
## 6  41           High school degree      5       5     25    10.2  0.392700
##    othdebt default
## 1 5.008608     Yes
## 2 4.000798      No
## 3 2.168925      No
## 4 0.821280      No
## 5 3.056564     Yes
## 6 2.157300      No

The bankloan.sav dataset, provided by IBM, contains information on 700 past and 150 prospective bank customers.

The variables of interest include:

age: Age in years,
ed: Level of education,
employ: Number of years with the current employer,
address: Number of years at the current address,
income: Household income in thousands,
debtinc: Debt-to-income ratio in percentage,
creddebt: Credit card debt in thousands,
othdebt: Other debt in thousands,
default: Default status (“Yes” or “No”).

Data Structure

str(data)
## 'data.frame':    850 obs. of  9 variables:
##  $ age     : num  41 27 40 41 24 41 39 43 24 36 ...
##  $ ed      : Factor w/ 5 levels "Did not complete high school",..: 3 1 1 1 2 2 1 1 1 1 ...
##  $ employ  : num  17 10 15 15 2 5 20 12 3 0 ...
##  $ address : num  12 6 14 14 0 5 9 11 4 13 ...
##  $ income  : num  176 31 55 120 28 25 67 38 19 25 ...
##  $ debtinc : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
##  $ creddebt: num  11.359 1.362 0.856 2.659 1.787 ...
##  $ othdebt : num  5.009 4.001 2.169 0.821 3.057 ...
##  $ default : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 1 1 2 1 ...
##  - attr(*, "variable.labels")= Named chr [1:12] "Age in years" "Level of education" "Years with current employer" "Years at current address" ...
##   ..- attr(*, "names")= chr [1:12] "age" "ed" "employ" "address" ...
##  - attr(*, "codepage")= int 65001

The dataset contains 7 numeric and 2 factor variables.

Summary Statistics

summary(data)
##       age                                   ed          employ      
##  Min.   :20.00   Did not complete high school:460   Min.   : 0.000  
##  1st Qu.:29.00   High school degree          :235   1st Qu.: 3.000  
##  Median :34.00   Some college                :101   Median : 7.000  
##  Mean   :35.03   College degree              : 49   Mean   : 8.566  
##  3rd Qu.:41.00   Post-undergraduate degree   :  5   3rd Qu.:13.000  
##  Max.   :56.00                                      Max.   :33.000  
##     address           income          debtinc         creddebt      
##  Min.   : 0.000   Min.   : 13.00   Min.   : 0.10   Min.   : 0.0117  
##  1st Qu.: 3.000   1st Qu.: 24.00   1st Qu.: 5.10   1st Qu.: 0.3822  
##  Median : 7.000   Median : 35.00   Median : 8.70   Median : 0.8851  
##  Mean   : 8.372   Mean   : 46.68   Mean   :10.17   Mean   : 1.5768  
##  3rd Qu.:12.000   3rd Qu.: 55.75   3rd Qu.:13.80   3rd Qu.: 1.8984  
##  Max.   :34.000   Max.   :446.00   Max.   :41.30   Max.   :20.5613  
##     othdebt         default   
##  Min.   : 0.04558   No  :517  
##  1st Qu.: 1.04594   Yes :183  
##  Median : 2.00324   NA's:150  
##  Mean   : 3.07879             
##  3rd Qu.: 3.90300             
##  Max.   :35.19750

Customer age ranges from 20 to 56 years, with a mean of approximately 35.03 years. The majority (460) did not complete high school. Employment duration spans 0 to 33 years, averaging 8.566 years.

Address duration ranges from 0 to 34 years, with a mean of approximately 8.372 years. Income varies from $13,000 to $446,000, averaging $46,680. Debt-to-income ratio ranges from 0.10% to 41.30%, with a mean of 10.17%. Credit card debt ranges from $11.7 to $20,561.3, with an average of $1,576.8. Other debts range from $45.58 to $35,197.50, averaging $3,078.79.

Regarding default status, 517 customers have not defaulted, 183 have defaulted, and 150 missing values refer to prospective customers.

Data Visualization

Selected Density Plots

ggplotly(
  ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
         aes(x = age, fill = default)) +
    geom_density(alpha = 0.5) +
    labs(title = "Age Distribution by Default Status"))
ggplotly(
  ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
         aes(x = employ, fill = default)) +
    geom_density(alpha = 0.5) +
    labs(title = "Employment Duration Distribution by Default Status"))
ggplotly(
  ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
         aes(x = debtinc, fill = default)) +
    geom_density(alpha = 0.5) +
    labs(title = "Debt-to-Income Ratio Distribution by Default Status"))
ggplotly(
  ggplot(data %>% filter(!is.na(default)), # Filter out prospective customers
         aes(x = creddebt, fill = default)) +
    geom_density(alpha = 0.5) +
    labs(title = "Credit Card Debt Distribution by Default Status"))

The visual narrative conveyed by the plots suggests a compelling relationship between lower age and shorter tenure at the current employer, both seemingly linked to a higher likelihood of default. Conversely, a lower debt-to-income ratio emerges as a potential mitigating factor against default risk. Notably, credit card debt stands out as less pivotal and alone may not be a decisive factor in distinguishing between default and non-default cases.

Correlation Heatmap

cormat <- cor(data %>%
                filter(!is.na(default)) %>% # Filter out prospective customers
                select_if(is.numeric)) # Consider only numeric variables
ggcorrplot(cormat, type = "lower", outline.color = "white", lab = TRUE) +
  ggtitle("Correlation Heatmap of Numeric Independent Variables")

The correlation heatmap among numeric independent variables shows modest associations (ranging from -0.03 to 0.63), with none reaching notably high levels. This suggests a lack of significant multicollinearity, supporting the decision to retain all variables for model estimation.

Data Preparation for Estimation

Recoding the Variables

data.1 <- data %>%
  mutate(
    ed_1 = as.numeric(ed == "Did not complete high school"),
    ed_2 = as.numeric(ed == "High school degree"),
    ed_3 = as.numeric(ed == "Some college"),
    ed_4 = as.numeric(ed == "College degree"),
    ed_5 = as.numeric(ed == "Post-undergraduate degree"),
    default_num = as.numeric(default == "Yes")
    )

data.2 <- data.1 %>%
  select(-ed, -ed_1, -default) %>% # Remove "ed_1" to avoid perfect multicollinearity
  filter(!is.na(default_num)) # Filter out prospective customers
head(data.2)
##   age employ address income debtinc  creddebt  othdebt ed_2 ed_3 ed_4 ed_5
## 1  41     17      12    176     9.3 11.359392 5.008608    0    1    0    0
## 2  27     10       6     31    17.3  1.362202 4.000798    0    0    0    0
## 3  40     15      14     55     5.5  0.856075 2.168925    0    0    0    0
## 4  41     15      14    120     2.9  2.658720 0.821280    0    0    0    0
## 5  24      2       0     28    17.3  1.787436 3.056564    1    0    0    0
## 6  41      5       5     25    10.2  0.392700 2.157300    1    0    0    0
##   default_num
## 1           1
## 2           0
## 3           0
## 4           0
## 5           1
## 6           0

Splitting the Data into Training and Test Set

set.seed(123)

indices <- sample(nrow(data.2), size = 0.7*nrow(data.2))

data.train <- data.2[indices, ]
data.test <- data.2[-indices, ]

print(paste0("Size of the training set: ", dim(data.train)[1]))
## [1] "Size of the training set: 489"
print(paste0("Size of the test set: ", dim(data.test)[1]))
## [1] "Size of the test set: 211"

Estimating Logit Model Using Top-Down Approach

Estimating the First Logit Model

fit.full <- glm(default_num ~ ., family = binomial(), data = data.train)
summary(fit.full)
## 
## Call:
## glm(formula = default_num ~ ., family = binomial(), data = data.train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.417248   0.696333  -2.035 0.041820 *  
## age          0.029300   0.020757   1.412 0.158065    
## employ      -0.234730   0.038944  -6.027 1.67e-09 ***
## address     -0.102565   0.026824  -3.824 0.000132 ***
## income      -0.011210   0.007906  -1.418 0.156205    
## debtinc      0.052413   0.033743   1.553 0.120359    
## creddebt     0.617149   0.129398   4.769 1.85e-06 ***
## othdebt      0.070464   0.087231   0.808 0.419214    
## ed_2         0.658946   0.288566   2.284 0.022400 *  
## ed_3         0.042296   0.436072   0.097 0.922732    
## ed_4         0.251575   0.529975   0.475 0.635006    
## ed_5         1.119653   1.303248   0.859 0.390271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.30  on 488  degrees of freedom
## Residual deviance: 397.68  on 477  degrees of freedom
## AIC: 421.68
## 
## Number of Fisher Scoring iterations: 6

Based on the current analysis, variables such as age, income, debtinc, othdebt, and certain education levels (ed_3, ed_4, ed_5) are not statistically significant at the 0.05 significance level. However, I believe that the variable debtinc is crucial in influencing a customer’s ability to repay a loan. Consequently, the forthcoming logistic regression model will be estimated, excluding all variables deemed statistically insignificant, except for the variable debtinc.

Estimating the Second Logit Model

fit.reduced <- glm(default_num ~ . - age - income - othdebt - ed_3 - ed_4 - ed_5,
                   family = binomial(), data = data.train)
summary(fit.reduced)
## 
## Call:
## glm(formula = default_num ~ . - age - income - othdebt - ed_3 - 
##     ed_4 - ed_5, family = binomial(), data = data.train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.98692    0.31550  -3.128 0.001759 ** 
## employ      -0.21882    0.03323  -6.585 4.54e-11 ***
## address     -0.08399    0.02296  -3.659 0.000254 ***
## debtinc      0.08053    0.02136   3.769 0.000164 ***
## creddebt     0.52086    0.09784   5.324 1.02e-07 ***
## ed_2         0.56799    0.26463   2.146 0.031844 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 564.30  on 488  degrees of freedom
## Residual deviance: 401.84  on 483  degrees of freedom
## AIC: 413.84
## 
## Number of Fisher Scoring iterations: 6
print(paste0("AIC for the First Logit Model: ", round(AIC(fit.full), 2)))
## [1] "AIC for the First Logit Model: 421.68"
print(paste0("AIC for the Second Logit Model: ", round(AIC(fit.reduced), 2)))
## [1] "AIC for the Second Logit Model: 413.84"

The Akaike information criterion (AIC) for the reduced logistic regression model is lower, indicating that the model performs better with the exclusion of certain variables. Additionally, in this refined model, all variables demonstrate statistical significance at the 0.05 level.

Training XGBoost (Extreme Gradient Boosting)

Data Formatting for Estimation and Validation

X.train <- as.matrix(data.train %>% select(-default_num))
y.train <- data.train$default_num

d.train <- xgb.DMatrix(X.train, label = y.train)

X.test <- as.matrix(data.test %>% select(-default_num))
y.test <- data.test$default_num

Specifying Parameters and Training the Model

params <- list(objective = "binary:logistic", eval_metric = "logloss")
xgb <- xgboost(params = params, data = d.train, nrounds = 50)
## [1]  train-logloss:0.547281 
## [2]  train-logloss:0.459220 
## [3]  train-logloss:0.400979 
## [4]  train-logloss:0.353110 
## [5]  train-logloss:0.313238 
## [6]  train-logloss:0.289052 
## [7]  train-logloss:0.261826 
## [8]  train-logloss:0.239064 
## [9]  train-logloss:0.224665 
## [10] train-logloss:0.214369 
## [11] train-logloss:0.206001 
## [12] train-logloss:0.198469 
## [13] train-logloss:0.182391 
## [14] train-logloss:0.173903 
## [15] train-logloss:0.165850 
## [16] train-logloss:0.160657 
## [17] train-logloss:0.153349 
## [18] train-logloss:0.145432 
## [19] train-logloss:0.139134 
## [20] train-logloss:0.133590 
## [21] train-logloss:0.130590 
## [22] train-logloss:0.125796 
## [23] train-logloss:0.121296 
## [24] train-logloss:0.119171 
## [25] train-logloss:0.116311 
## [26] train-logloss:0.110614 
## [27] train-logloss:0.104667 
## [28] train-logloss:0.102794 
## [29] train-logloss:0.098316 
## [30] train-logloss:0.093869 
## [31] train-logloss:0.090872 
## [32] train-logloss:0.087533 
## [33] train-logloss:0.086606 
## [34] train-logloss:0.082780 
## [35] train-logloss:0.079901 
## [36] train-logloss:0.078798 
## [37] train-logloss:0.077504 
## [38] train-logloss:0.074678 
## [39] train-logloss:0.072353 
## [40] train-logloss:0.070316 
## [41] train-logloss:0.067904 
## [42] train-logloss:0.065669 
## [43] train-logloss:0.063883 
## [44] train-logloss:0.062214 
## [45] train-logloss:0.060305 
## [46] train-logloss:0.058405 
## [47] train-logloss:0.056882 
## [48] train-logloss:0.055465 
## [49] train-logloss:0.054015 
## [50] train-logloss:0.053009

objective = "binary:logistic": This sets the objective function for the XGBoost model to binary logistic regression, indicating that the model is being trained for binary classification (0 or 1 outcomes). eval_metric = "logloss": This specifies the evaluation metric to be used during training. In this case, it is the log-loss, a common metric for classification problems. nrounds = 50: This parameter indicates the number of boosting rounds (iterations) for training. The model will be trained for 50 rounds.

Evaluating Predictive Ability

Creating Confusion Matrices (Test Set)

lr.pred <- predict(fit.reduced, data.test, type = "response")
xgb.pred <- predict(xgb, X.test)

lr.confmat <- table(true = y.test, pred = round(lr.pred))
xgb.confmat <- table(true = y.test, pred = round(xgb.pred))

print("Logit Model Confusion Matrix:")
## [1] "Logit Model Confusion Matrix:"
print(lr.confmat)
##     pred
## true   0   1
##    0 148   9
##    1  31  23
print("XGBoost Confusion Matrix:")
## [1] "XGBoost Confusion Matrix:"
print(xgb.confmat)
##     pred
## true   0   1
##    0 147  10
##    1  32  22

Computing Accuracies (Test Set)

calc_accuracy <- function(confmat) {
  return(sum(diag(confmat))/sum(confmat))
}

acc <- sapply(list(lr.confmat, xgb.confmat), calc_accuracy)

print(paste0("Accuracy of the Logit Model: ", round(acc[1], 4)))
## [1] "Accuracy of the Logit Model: 0.8104"
print(paste0("Accuracy of the XGBoost Model: ", round(acc[2], 4)))
## [1] "Accuracy of the XGBoost Model: 0.8009"

The accuracy of the Logistic Regression model is reported as 0.8104, indicating that the model correctly predicted outcomes for approximately 81.04% of the instances in the test dataset. In comparison, the XGBoost model achieved an accuracy of 0.8009, suggesting it correctly predicted outcomes for approximately 80.09% of the instances, slightly lower than the Logit model.

Given the higher accuracy of the Logistic Regression model compared to XGBoost, the conclusion is to prefer the Logit model for making predictions on this particular dataset.

Saving the Chosen Model to a File

saveRDS(fit.reduced, "fit.reduced.rds")

Interpreting Estimation Results

Computing Average Marginal Effects (AME)

AME <- summary(margins(glm(default_num ~ employ + address + debtinc + creddebt + ed_2,
                           family = binomial(), data = data.train)))
AME
##    factor     AME     SE       z      p   lower   upper
##   address -0.0112 0.0029 -3.8265 0.0001 -0.0169 -0.0055
##  creddebt  0.0693 0.0117  5.9143 0.0000  0.0464  0.0923
##   debtinc  0.0107 0.0027  4.0081 0.0001  0.0055  0.0160
##      ed_2  0.0756 0.0345  2.1895 0.0286  0.0079  0.1433
##    employ -0.0291 0.0037 -7.7916 0.0000 -0.0365 -0.0218

Interpretations:

Each additional year of living at the current address reduces by 0.0112 (on average) the probability that a given person will default on a loan.
Each additional $1000 of credit card debt increases by 0.0693 (on average) the probability that a given person will default on a loan.
Each additional percentage point of debt to income ratio increases by 0.0107 (on average) the probability that a given person will default on a loan.
A person with a high school degree has a higher probability of defaulting on a loan compared to an individual who did not complete high school by 0.0756 (on average).
Each additional year of working with the current employer reduces by 0.0291 (on average) the probability that a given person will default on a loan.

All interpretations are given under the ceteris paribus assumption.

Computing Odds Ratio

oddsratio <- exp(coef(fit.reduced))
oddsratio
## (Intercept)      employ     address     debtinc    creddebt        ed_2 
##   0.3727246   0.8034623   0.9194400   1.0838592   1.6834770   1.7647142

Interpretations:

Each additional year of working with the current employer reduces by 19.65% (on average) the chance that a given person will default on a loan.
Each additional year of living at the current address reduces by 8.06% (on average) the chance that a given person will default on a loan.
Each additional percentage point of debt to income ratio increases by 8.39% (on average) the chance that a given person will default on a loan.
Each additional $1000 of credit card debt increases by 68.35% (on average) the chance that a given person will default on a loan.
A person with a high school degree has a higher chance of defaulting on a loan compared to an individual who did not complete high school by 76.47% (on average).

All interpretations are given under the ceteris paribus assumption.

Forecasting Default of Prospective Customers

Calculating Probabilities of Default for New Customers

newdata <- data.1 %>%
  filter(is.na(default_num)) # Filter out past customers
predict(fit.reduced, newdata, type = "response")
##            1            2            3            4            5            6 
## 0.0119050043 0.0550420559 0.5495592856 0.0769904855 0.3137495993 0.5071960731 
##            7            8            9           10           11           12 
## 0.3087182473 0.8103898499 0.0881504307 0.1099214472 0.0119921106 0.0244792620 
##           13           14           15           16           17           18 
## 0.0037937538 0.0032832309 0.2808662793 0.3029158979 0.9432149171 0.0267555391 
##           19           20           21           22           23           24 
## 0.3624804105 0.0197137690 0.2070791793 0.0416102378 0.0954745068 0.0002040433 
##           25           26           27           28           29           30 
## 0.2409626785 0.1060022436 0.0307582848 0.0897410456 0.0022571181 0.0921404583 
##           31           32           33           34           35           36 
## 0.1161042725 0.0078190458 0.5902502073 0.0336127178 0.0261221378 0.1390711157 
##           37           38           39           40           41           42 
## 0.2828656665 0.0870747638 0.4111603583 0.1938147002 0.5324854517 0.0257144808 
##           43           44           45           46           47           48 
## 0.0015270696 0.0674604814 0.0430344504 0.7650876319 0.1788327110 0.0049805739 
##           49           50           51           52           53           54 
## 0.1062297928 0.0024293667 0.0023051076 0.2250349489 0.0712377423 0.0072830792 
##           55           56           57           58           59           60 
## 0.3662047979 0.0227113025 0.6217402410 0.4022671678 0.3521076868 0.4094913486 
##           61           62           63           64           65           66 
## 0.0525038408 0.1630601778 0.0867669771 0.8496705865 0.0695080132 0.0357077939 
##           67           68           69           70           71           72 
## 0.1375358435 0.1048104095 0.0660165321 0.1639280598 0.6656053215 0.0116275198 
##           73           74           75           76           77           78 
## 0.2616445599 0.6424549859 0.0001744932 0.3849093400 0.4614931960 0.6038249402 
##           79           80           81           82           83           84 
## 0.1688422865 0.0013299229 0.1070778100 0.5997702963 0.3949930049 0.4808007995 
##           85           86           87           88           89           90 
## 0.7443479758 0.1837060177 0.0058206773 0.3568944160 0.0478891540 0.7749034519 
##           91           92           93           94           95           96 
## 0.0779068464 0.2784830770 0.0056742890 0.3254708058 0.0214679766 0.0341424624 
##           97           98           99          100          101          102 
## 0.3764425042 0.4697291368 0.0025570769 0.3889150449 0.3125489935 0.7886344646 
##          103          104          105          106          107          108 
## 0.3175598435 0.9906053912 0.1100970153 0.1106981901 0.0224590251 0.8941601132 
##          109          110          111          112          113          114 
## 0.0609606307 0.1251843631 0.9853058673 0.0128988337 0.0443550470 0.2387587336 
##          115          116          117          118          119          120 
## 0.0220276772 0.3303856563 0.1658973002 0.2242050726 0.3730342845 0.2304217285 
##          121          122          123          124          125          126 
## 0.2872739246 0.1544030297 0.6170547161 0.1703493812 0.2018713659 0.5705938427 
##          127          128          129          130          131          132 
## 0.7472585411 0.0118394777 0.1491414865 0.0297557238 0.0210205280 0.0171486296 
##          133          134          135          136          137          138 
## 0.0129913424 0.1153478148 0.1170132762 0.0125820129 0.9430188128 0.1222569862 
##          139          140          141          142          143          144 
## 0.0105773685 0.2074632426 0.9914460841 0.0494227745 0.0095099863 0.2127732915 
##          145          146          147          148          149          150 
## 0.1536838185 0.0106594749 0.1958628444 0.0290399748 0.3282543609 0.0070464700

Final Remarks

My investigation into predicting customer default in the banking sector has provided valuable insights. Initially, visual cues suggested connections between lower age, shorter employment duration, and a higher likelihood of default, with debt-to-income ratio as a mitigating factor. While statistical analysis confirmed the importance of employment duration and debt-to-income ratio, it debunked the significance of age. Furthermore, it unveiled the importance of variables like credit card debt, years at the current address, and education level.

This emphasizes the shift from visual intuition to statistical rigor. While data visualization provides initial insights, statistical analysis plays a crucial role in navigating the complexities of predictive modeling.